165 research outputs found

    Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays

    Get PDF
    Local moments are used for local regression, to compute statistical measures such as sums, averages, and standard deviations, and to approximate probability distributions. We consider the case where the data source is a very large I/O array of size n and we want to compute the first N local moments, for some constant N. Without precomputation, this requires O(n) time. We develop a sequence of algorithms of increasing sophistication that use precomputation and additional buffer space to speed up queries. The simpler algorithms partition the I/O array into consecutive ranges called bins, and they are applicable not only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM, etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A more sophisticated approach uses hierarchical buffering and has a logarithmic time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b. Using Overlapped Bin Buffering, we show that only a single buffer is needed, as with wavelet-based algorithms, but using much less storage. Applications exist in multidimensional and statistical databases over massive data sets, interactive image processing, and visualization

    Random sampling with a reservoir

    Full text link

    When Random Sampling Preserves Privacy

    Full text link
    Abstract. Many organizations such as the U.S. Census publicly release samples of data that they collect about private citizens. These datasets are first anonymized using various techniques and then a small sample is released so as to enable “do-it-yourself ” calculations. This paper investigates the privacy of the second step of this process: sampling. We observe that rare values – values that occur with low frequency in the table – can be problematic from a privacy perspective. To our knowledge, this is the first work that quantitatively examines the relationship between the number of rare values in a table and the privacy in a released random sample. If we require ɛ-privacy (where the larger ɛ is, the worse the privacy guarantee) with probability at least 1 − ή, we say that 1 a value is rare if it occurs in at most Õ ( ) rows of the table (ignoring log ɛ factors). If there are no rare values, then we establish a direct connection between sample size that is safe to release and privacy. Specifically, if we select each row of the table with probability at most ɛ then the sample is O(ɛ)-private with high probability. In the case that there are t rare values, then the sample is Õ(ɛή/t)-private with probability at least 1 − ή.

    X-Stream: Edge-centric Graph Processing using Streaming Partitions

    Get PDF
    X-Stream is a system for processing both in-memory and out-of-core graphs on a single shared-memory machine. While retaining the scatter-gather programming model with state stored in the vertices, X-Stream is novel in (i) using an edge-centric rather than a vertex-centric implementation of this model, and (ii) streaming completely unordered edge lists rather than performing random access. This design is motivated by the fact that sequential bandwidth for all storage media (main memory, SSD, and magnetic disk) is substantially larger than random access bandwidth. We demonstrate that a large number of graph algorithms can be expressed using the edge-centric scatter-gather model. The resulting implementations scale well in terms of number of cores, in terms of number of I/O devices, and across different storage media. X-Stream competes favorably with existing systems for graph processing. Besides sequential access, we identify as one of the main contributors to better performance the fact that X-Stream does not need to sort edge lists during pre-processing

    Lavoisier: A Low Altitude Balloon Network for Probing the Deep Atmosphere and Surface of Venus

    Get PDF
    The in-situ exploration of the low atmosphere and surface of Venus is clearly the next step of Venus exploration. Understanding the geochemistry of the low atmosphere, interacting with rocks, and the way the integrated Venus system evolved, under the combined effects of inner planet cooling and intense atmospheric greenhouse, is a major challenge of modern planetology. Due to the dense atmosphere (95 bars at the surface), balloon platforms offer an interesting means to transport and land in-situ measurement instruments. Due to the large Archimede force, a 2 cubic meter He-pressurized balloon floating at 10 km altitude may carry up to 60 kg of payload. LAVOISIER is a project submitted to ESA in 2000, in the follow up and spirit of the balloon deployed at cloud level by the Russian Vega mission in 1986. It is composed of a descent probe, for detailed noble gas and atmosphere composition analysis, and of a network of 3 balloons for geochemical and geophysical investigations at local, regional and global scales

    Fading histograms in detecting distribution and concept changes

    Get PDF
    The remarkable number of real applications under dynamic scenarios is driving a novel ability to generate and gatherinformation.Nowadays,amassiveamountofinforma- tion is generated at a high-speed rate, known as data streams. Moreover, data are collected under evolving environments. Due to memory restrictions, data must be promptly processed and discarded immediately. Therefore, dealing with evolving data streams raises two main questions: (i) how to remember discarded data? and (ii) how to forget outdated data? To main- tain an updated representation of the time-evolving data, this paper proposes fading histograms. Regarding the dynamics of nature, changes in data are detected through a windowing scheme that compares data distributions computed by the fading histograms: the adaptive cumulative windows model (ACWM). The online monitoring of the distance between data distributions is evaluated using a dissimilarity measure based on the asymmetry of the Kullback–Leibler divergence.The experimental results support the ability of fading his- tograms in providing an updated representation of data. Such property works in favor of detecting distribution changes with smaller detection delay time when compared with stan- dard histograms. With respect to the detection of concept changes, the ACWM is compared with 3 known algorithms taken from the literature, using artificial data and using pub- lic data sets, presenting better results. Furthermore, we the proposed method was extended for multidimensional and the experiments performed show the ability of the ACWM for detecting distribution changes in these settings
    • 

    corecore